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Abstract 

Reliable storage emulations from fault-prone components have established themselves as an algorith¬ 
mic foundation of modern storage services and applications. Most existing reliable storage emulations 
are built from storage services supporting arbitrary read-modify-write primitives. Since such primitives 
are not typically exposed by pre-existing or off-the-shelf components (such as cloud storage services or 
network-attached disks) it is natural to ask if they are indeed essential for efficient storage emulations. 
In this paper, we answer this question in the affirmative. We show that relaxing the underlying storage 
to only support read/write operations leads to a linear blow-up in the emulation space requirements. We 
also show that the space complexity is not adaptive to concurrency, which implies that the storage cannot 
be reliably reclaimed even in sequential runs. On a positive side, we show that Compare-and-Swap prim¬ 
itives, which are commonly available with many off-the-shelf storage services, can be used to emulate a 
reliable multi-writer atomic register with constant storage and adaptive time complexity. 
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1 Introduction 


Reliable storage emulations seek to construct fault-tolerant shared primitives, such as read/write registers, 
from a collection of failure-prone components, such as storage servers, or network-attached disks. These 
emulations are core enablers of many modern storage services and applications, such as cloud and online 
data stores ffluiagiia and Storage-as-a-Service offerings ||6l|71[8l|9l. 

Most existing emulation algorithms are constructed from storage services capable of supporting custom- 
built read-modify-write (RMW) primitives ifTOl [TTl [TTl \V2\ [T3l [TH [T5l . For example, the ABD algo¬ 
rithm ifTOll . emulating a fault-tolerant atomic read/write register from crash-prone nodes, assumes that each 
node has an ability to test and update the stored data along with its associated metadata in a single atomic 
step. In reality though reliable storage services must often be built from pre-existing or off-the-shelf building 
blocks (such as network-attached disks or cloud storage services), which typically offer a set collection of 
read/write capabilities sometimes augmented with simple conditional update primitives similar to Compare- 
and-Swap (CAS). 

In this paper, we study the question of what minimal functionality must be supported by fault-prone 
storage nodes to enable space-efficient emulations of reliable storage primitives. We start by considering 
storage servers equipped with read/write primitives, which we abstract as read/write atomic registers. A 
notable prior work assuming a similar setting is Disk Paxos ifT^ . which builds a reliable consensus service 
from crash-prone network attached disks. Interestingly, in Disk Paxos, each client is allocated a dedicated 
register on each server, which naturally leads to the question if linear space is necessary for constructing 
reliable multi-writer storage from fault-prone read/write primitives. 

In Section [3l we prove that this is indeed inherent: the number of registers required to implement a re¬ 
liable multi-writer read/write register for k clients from a collection of multi-writer multi-reader (MWMR) 
atomic read/write registers hosted on crash-prone servers requires at least kf registers where / is the max¬ 
imum number of tolerated server failures. We further show that no such algorithm can have its storage 
consumption adaptive to concurrency, which implies that the storage costs cannot be further optimized (e.g., 
by reclaiming old values) even in sequential runs. Since the registers can be assigned to the servers in a vari¬ 
ety of ways, we further restrict possible assignments by showing that if the number of registers per server is 
bounded by a known constant m, then supporting im clients requires f + I more servers in addition to the 
requisite if servers stipulated by our storage bound. Our bounds apply to any fault-tolerant implementations 
of a MWMR register, which are at least single-writers safe (a consistency notion weaker than the standard 
multi-writer safety ifTTl HU), and solo-terminating (a weak liveness condition where only the operations 
eventually run in isolation are required to terminate). 

We prove our results in a fault-prone shared memory model which faithfully captures the 

settings where constituent storage services are provided as pre-existing building blocks. Our impossibility 
proofs employ a variation of a covering argument l[2^ to construct a sequential run where / new registers 
become covered with each consecutive write invoked by a client thus gradually exhausting the available 
storage capacity. 

Understanding the cost of using read/write primitives, we turn our attention to identifying a simple 
RMW primitive that can be used to efficiently support a reliable emulation. We focus on Compare-and- 
Swap (CAS), which closely matches a variety of conditional write primitives available with many of the 
today’s cloud storage service interfaces IIllTlllllHIH. In Section HI we present a constant space emulation 
of a MWMR atomic read/write register that utilizes a single CAS object per server, and tolerates up to 
a minority of server crashes. Our emulation is derived in a modular fashion by first constructing the ABD 
update primitive from a single CAS object, and then plugging the resulting construction into the multi-writer 
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ABD emulation ifTOl [T^ . We show that the time complexity our implementation matches that of ABD in 
contention-free runs, and, at the worst case, is adaptive to the number of concurrently executing clients. 

2 Preliminaries 

2.1 Model 

We consider an asynchronous fault-prone shared memory system II19II consisting of a set of base objects 
B = { 6 i, 62 ) ■ • • The objects are accessed by clients from some set C = {ci, C 2 ,... }. The clients interact 
with base objects via a set of operations supported by the objects. We will consider base objects supporting 
either simple read and write (i.e., read/write registers) or compare-and-swap (CAS) operations. 

We consider a slight generalization of the model in |[T^ where the objects are mapped to a set <S = 
{si, S 2 , • • • } of servers via a function d from B to S. For B C B, we will write 6(B) to denote the image of 
B, i.e., 6(B) = {6(b) : b G B}. Conversely, for S C S, we will write 6~^(S) to denote the pre-image of S, 
i.e., 6 “^ (S') = {6 : 6(b) G S}. Both servers and clients can fail by crashing. A crash of a server causes all 
objects mapped to that server to instantaneously crashQ. 

We study algorithms that emulate shared read/write registers to a set of clients. Clients interact with the 
emulated register via high-level read and write operations. To distinguish the high-level emulated reads and 
writes from low-level base object access, we refer to the former as READ and WRITE. We say that high-level 
operations are invoked and return whereas low-level operations are triggered and respond. A high-level 
operation consists of a series of trigger and respond actions on base objects, starting with the operation’s 
invocation and ending with its return. Since base objects are crash-prone, clients must be able to continue 
executing without awaiting responses to previously issued operations. Thus, the trigger actions occur locally 
at clients without involving any actual interaction with their target base objects. Once triggered a low-level 
operation can then take effect (or, be applied to) the base object state followed by a response being returned 
to the client. 

An algorithm A defines the behavior of clients as deterministic state machines where state transitions 
are associated with actions, such as trigger/response of low-level operations. A configuration is a mapping 
to states from system components, i.e., clients and base objects. An initial configuration is one where all 
components are in their initial states. 

A run of algorithm A is a (finite or infinite) sequence of alternating configurations and actions, beginning 
with some initial configuration, such that configuration transitions occur according to A. We use the notion 
of time t during a run r to refer to the configuration reached after the f* action in r. A run fragment 
is a contiguous sub-sequence of a run. A run is write-only if it has no invocations of the high-level read 
operations. 

We say that a base object, client, or server is, faulty in a run r if it fails at some time in r, and correct, 
otherwise. A run is fair if (1) for every low-level operation triggered by a correct client on a correct base 
object, there is eventually a matching response, and ( 2 ) every correct client gets infinitely many opportunities 
to both trigger a low-level operation and execute the return actions. We say that a low-lever operation on a 
base object is pending in run r if it was triggered but has no matching response in r. 

We say that a high-level operation opi precedes a high-level operation opj in a run r, denoted opi -<r opj, 
if opi returns before opj is invoked in r. Operations opi and opj are concurrent in a run r, if neither one 
precedes the other. A run with no concurrent operations is sequential. 

'Note that the original faulty shared model of ED can be derived from our model by choosing 5 to be an injective function. 
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2.2 Storage Service Definitions 


We study storage services emulating a multi-writer/multi-reader (MWMR) register, which stores values from 
a domain V, and offers an interface for invoking read and write operations. Initially, the register holds some 
distinguished initial value vq G V. The sequential specification of the register is as follows: A read returns 
the latest written value, or vq if none was written. 

Liveness We consider the following liveness conditions that must be satisfied in fair runs of an emulafion 
algorifhm. A wait-free objecf is one fhaf guaranfees fhaf every high-level operation invoked by a correcf 
clienf evenfually refurns, regardless of fhe actions of ofher clienfs. A solo-terminating objecf guaranfees fhaf 
every high-level operafion fhaf fakes sfeps in isolafion evenfually refurns. 

Safety Two runs are equivalent if every clienf performs fhe same sequence of high-level operafions in bofh, 
where operations fhaf are pending in one can be eifher included (wifh some response) in or excluded from 
the other. A linearization of a run r is an equivalent sequential run that satisfies r’s operafion precedence 
relafion and fhe objecf’s sequential specification. 

We consider the following safety requirements for an emulation algorithm. A run of the emulation 
algorithm satisfies atomicity if if has a linearizafion. An emulafed objecf is atomic (or, linearizable) if all 
ifs runs salisfy afomicify. For our sforage lower bound, we will also consider fhe following weak safely 
guarantee: A run r of fhe MWMR emulafion algorifhm is single-writers if no Iwo write operations overlap 
in r: i.e., for any fwo disfincf writes Wi and Wj in r either Wi -<r Wj or Wj -<r Wi. A run r of the MWMR 
register emulation algorithm satisfies safety ifTTl if for every read rd fhaf refurns in r and does nof overlap 
any wriles, fhere exisls a linearization Lrd of the subsequence of r consisting of all write operations in r and 
rd. An emulated MWMR register is single-writers safe (SW-safe) if all its single-writers runs satisfy safety. 

For our space lower bound, we will restrict our attention to single-reader (SR) emulations where only a 
single designated client is allowed to read the emulated register. 

Fault-Tolerance The emulation algorithm is /-tolerant if it remains correct (in the sense of its safety and 
liveness properties) as long as at most / servers crash for a fixed / > 0. 

Complexity measures The resource consumption of an emulation algorithm A in a (finite) run r is the 
number of base objects used by A in r. The resource complexity ifT^ of A is the maximum resource 
consumption of A in all its runs. To measure running time, we assume that each operation triggered on a 
base object takes at most one unit of time to complete, and the local computation delays are negligibly small. 
The (asynchronous) time complexity of A |[2^ is then the maximum time required by any client to complete 
the high-level object invocation. 

Adaptivity to Contention Given a run fragment r of an emulation algorithm, the point contention |25l 
of r, PntCont(r), is the maximum number of clients that have an incomplete high-level invocation after 
some finite prefix of r. Similarly, we use PnlConl(op) fo denofe PnlConl(rop), where Vop is fhe run fragmenl 
including all evenfs befween fhe op’s invocation and response. 

The resource complexify of A is adaptive to point contention if fhere exisfs a funcfion M such fhaf affer 
all finite runs r of A, fhe resource consumption of A in r is bounded by M(PnfConf(r)). Likewise, fhe time 
complexify of A is adaptive to point contention if fhere exisfs a funcfion T such fhaf for each clienf Cj, and 
operafion op, fhe fime fo complefe fhe invocafion of op by q is bounded by T(PnfConf(op)). 

3 Resource Complexity of Emulating SW-Safe MWSR Register 

In fhis secfion, we prove fhaf any /-foleranf emulafion of a solo-ferminafing mulfi-wrifer/single-reader 
(MWSR) SW-safe register for k clienfs from of a collection of MWMR afomic regisfers stored on crash- 
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prone servers has resource complexity kf. As there are many possible ways in which these kf registers can 
be mapped to the given set of servers, we further restrict possible mappings by showing that if the number of 
registers assigned to each server is at most m, then for any £ > 0, the number of servers required to support 
£m clients is at least if + f + 1. In other words, supporting that many clients requires extra / + 1 servers 
in addition to if stipulated by our resource complexity bound. For completeness, we will also show that 
2/ + 1 servers are necessary regardless of the individual server capacities though this bound can also be de¬ 
rived from well-known results (e.g., |[2^IT7l l. Our last result shows that the emulation resource complexity 
cannot be adaptive to point contention. 

Our proof exploits the fact that the environment is allowed to prevent a pending low-level write from 
taking effect on the base object states for arbitrary long. As a result, a client cannot reliably store a value in 
a base register having a pending write (by a different client) as this write may take effect at a later time thus 
erasing the stored value. We will reuse the terminology of ll22l . and refer to a pending write operation W 
on some base register 6 as a covering write, and to b as being covered by W. 

For any time t (following the t* action) in a run r of the emulation algorithm we define the following: 

• C{t): the set of clients that have completed a high-level write operation on the emulated register at 
time < t. 

• Cov{t)\ the set of the base registers that have a covering low-level write at time t. 

We first prove the following key lemma: 

Lemma 3.1 For all F C S such that |F| = /, there exists a write-only sequential run ri of an f-tolerant 
algorithm that emulates an SW-safe solo-terminating MWSR register consisting of i > it complete high- 
level writes of values vi,... ,Vi by i distinct clients ci,..., q, and ti steps such that \Cov{ti)\ > if, and 
5{Cov{ti)) n F = 0. 

We construct inductively as follows. First, it is easy to see that a run tq consisting of fo = 0 steps 
satisfies the lemma. Next, fix an arbitrary set of servers F such that |F| = /, and assume that rj_i exists for 
all z > 0. We show how rj_i can be extended up to time F > fj-i so that the lemma holds for the resulting 
run. 

We introduce the following notation for all times t > U-i: 

• Trft): the set of base registers which had a low-level write triggered on between fj_i and t. 

• Covi{t) = Cov{f) \ Cov{ti-i)\ the set of base registers that have been newly covered between 
and t. Note that Covft) C Tri{t). 

• Qi{t) C S', the set of servers such that Qi{t) = 5{Covi{t)) \ F if \6{Covi{t)) \ F\ < f, and 
Qiit) = Qi(t — 1), otherwise. 

We will define the following adversarial behaviour of the environment, which whilst being tolerated by 
the algorithm causes it to consume a gradually growing amount of the storage resources: 

Definition 3.2 (Adi) : At any time t > ti-i: prevent the following writes from taking effect on the base 
register states: 

L all covering writes by clients in Cfi-f), and 
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2. all covering writes on the base registers in 

Observation 1 If the environment behaves like Adi, then for all t > U-i, Qi{t) C Qi[t + 1). 

We first show that rj_i can be extended with a complete high-level write Wj by a new client c, such 
that the environment behaves like Adi until Wi returns. Intuitively, this means that Adi delays applying 
low-level writes triggered by Cj on at most / servers as well as the past covering writes. As a result Cj cannot 
distinguish this scenario from the one where all the involved servers and clients have crashed, and therefore, 
by solo-termination, must return without before receiving the delayed replies. 

Lemma 3.3 Suppose that the environment behaves like Adi, cmd let Wi be a high-level write invocation by 
client Ci 0 C{ti-i). Then, there exists time tr > U-i at which Wi returns while the environment continues 
to behave like Adi until tr- 

Proof: By definition of Adi, there exists time tf > ti-i such that for all times t ^ 

Wi returns before tf, then tr = tf satisfies the lemma. Otherwise, for each server s G Qi{tf), let tg be the 
earliest time such that s € Qi(ts)- Since by Observation [T] Qi(t) C Qi(tf) for all f < fy, s G Qi{t), for all 
t > tg. 

Let r' be a fair run, which includes the same sequence of steps as ri_i up to time tf, and in addition, 
each server s G Qi{tf) fails immediately after the step tg, and each client ci,..., Cj_i fails before any of its 
covering writes on registers in Cov{ti-i) takes effect on the register states. Since r' is fair, by /-tolerance 
and solo-termination, there exists time t' at which Wi returns in r'. Since Vi-i is indistinguishable from r' 
to Ci for the entire duration of Wi, it must return in rj_i at time tr = t' as well. □ 

We next show that in order to guarantee correctness in the face of the environment behaving like Adi, 
Wi must trigger a low-level write on at least one non-covered base register on each server in a set of 2 / -|- 1 
servers. 

Lemma 3.4 Let Wi be a high-level write invocation by client Ci 0 Cfi-f) that returns at time tr > fi-i, 
and suppose that the environment behaves like Adi until tr- Then, \5{Tri{tr) \ Cov{ti-i))\ > 2f. 

Proof: Denote M = 6 {Tri{tr)\Cov{ti-i)), and assume by contradiction that |M| < 2/. Let Si = MCiF, 
S 2 = Qi{tr), and S 3 = M \ {Si U S 2 )- Note that 5i, ^ 2 , S 3 are pairwise disjoint, M = 5i U ^2 U S 3 , and 
by definition of Qi{tr), and since |F| = /, IS"! U Ssl = |S'i| -|- [S’sl < /. 

Let r be a run, which is identical to Vi-i up to time ti-i, after which all the covering writes in Vi-i take 
effect on register states, and all servers in the set 5i U S 3 crash. Extend r with an invocation of a high-level 
read operation R by client Crd 7 ^ Q. Since r is fair, by solo-termination and /-tolerance, there exists time 
trd > ti -1 at which R returns. Since r is single-writers, by SW-safety, R must return Vi-i- 

Let r' be a run, which is identical to rj_i up to time R, after which it is extended to time t' > tr 
by having all servers in the set Si U S 3 crash, and the covering writes in rj_i to take effect on the base 
register states. As a result, the values stored in the registers in Cov{ti-i) are now identical to those in r. 
Furthermore, since Adi prevents all low-level writes triggered on registers in 6 ~^{S 2 ) from taking effect 
before R, their values are also the same as those in r. Thus, at t', all registers in both r and r' have the same 
content. 

We extend r' by having client Crd 7 ^ Q to invoke high-level read R while allowing the environment to 
continue preventing all covering writes by client Cj on the registers in 5~^{S2) from taking effect on their 
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states. Since r' is indistinguishable from r to Crd, the sequence of steps executed by Crd in r' is the same as 
that in r. Hence, R returns Vi-i in r'. However, since Wi is the last complete write preceding R in r', by 
SW-safety, the i?’s return value must be Vi ^ Vi-i. A contradiction. 

□ 


The following two corollaries follow immediately from Lemmas [3.31 and [T4l 

Corollary 3.5 Let Wi be a high-level write invocation by client Ci 0 C{ti-i) that returns at time R > fi-i, 
and suppose that the environment behaves like Adi until tr- Then, Qi{tr) = f- 

Corollary 3.6 For all i > 0, |<S \ 6 {Cov{ti-i))\ > 2f. 

We are now ready to prove Lemma ITTI 

Proof: [of Lemma ITTII By Lemmarj_i can be extended with a complete high-level write Wi by client 
a 7 ^ Ci-i writing a value Vi ^ Vi-i while allowing the environment to behave like Adi until time tr when 
Wi returns. We further extend rj_i by allowing the environment to behave like Adi until time t' > R when 
all writes triggered after ti-i on the registers in d~^{F) take effect. Hence, F n 6{Covi{t')) = 0. 

Since by Corollary 13.51 Qi{tr) = /, and by Observation [T] Qi{tr) F Qi{t'), Qi{t') = /, and therefore, 
\Covi{t')\ > f. Now since Covi{t') and C'ou(fj_i) are disjoint, Cov{t') = Cov{ti-i) U Covi{t'), and 
by the induction hypothesis \Cov{ti-i)\ > (i — 1)/, and S(Cov(ti_i)) D F = 0, we receive \Cov{t')\ > 

{i — l)f + f = if, and 6 {Cov{t')) H F = {6{Cov{ti-i)) n F) U {6{Covi{t')) D F) = 0. Thus, ti = t' 
satisfies the lemma. □ 

Resource Complexity The following theorem follows immediately from Lemma ITT] (please see Section I aI 
of the Appendix for a full proof): 

Theorem 3.7 For any k > 0, f > 0, there is no f-tolerant algorithm emulating an SW-safe solo-terminating 
MWSR register for k clients using less than kf base registers. 

Number of Servers We now turn our attention to deriving the number of servers required for supporting 
the emulation. The following result follows immediately from Corollary 13.61 (please see Section |A] of the 
Appendix for a full proof), but can also be derived from well-known results in the literature (e.g., IHIIIITI) 

Theorem 3.8 For any k > 0, and / > 0, there is no f-tolerant algorithm emulating an SW-safe solo- 
terminating MWSR register for k clients with less than 2f 1 servers. 

Next, we show that if the storage per server is bounded by a known constant, an extra / -|- 1 servers 
beyond the minimum capacity established by Theorem 13.71 are necessary to accommodate a given number 
of clients. 

Theorem 3.9 For any m > 0, ^ > 0, and / > 0, there is no f-tolerant algorithm emulating an SW-safe 
solo-terminating MWSR register for k > tm clients using less than if-\-f-\-l servers if each server can 
store at most m registers. 

Proof: Assume by contradiction there exists an /-tolerant algorithm A emulating an SW-safe solo-terminating 
MWSR register for k = Im clients using if f servers. Fix a set F C S, such that |F| = /, and let 
N < mf be the number of registers mapped to the servers in F. 
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By Lemma im there exists a run Vk-i of A consisting of A; — 1 = Im — 1 high-level writes by A: — 1 
distinct clients such that by the end of r^-i, the number of distinct base registers having a covering write is 
at least (A: — 1)/, and no registers in 5“^ [F) have a covering write. Thus, the number of registers that remain 
not covered by the end of Vk-i is at most ifm + N — {k — 1)/ = £fm + N — ifm + f = N + f = R. 

Now since no register in 5”^ (F) has a covering write, N out of total R registers must be mapped to the 
/ servers in F. And since the remaining / registers can be mapped to at most / servers, by the end of r^-i, 
the total number of servers that may have a register without a covering write is at most 2/. A contradiction 
to Corollary 13.61 □ 

Adaptivity We show that no SW-safe solo-terminating MWSR register can have a fault-tolerant emulation 
adaptive to point contention: 

Theorem 3.10 For any / > 0, there is no f-tolerant algorithm that emulates an SW-safe solo-terminating 
MWSR register with resource complexity adaptive to point contention. 

Proof: Pick an arbitrary / > 0, and assume by contradiction that such an algorithm A exists. By 
Lemma 13.11 there exists a run r of A consisting of k high-level writes by k distinct clients such that the 
resource complexity grows by / for each consecutive write that completes in r whereas the point con¬ 
tention remains equal 1 for the entire r. We conclude that no function mapping point contention to resource 
consumption can exist, and therefore, A’s resource complexity is not adaptive to point contention. A con¬ 
tradiction. □ 


4 Atomic Register Implementation 

In this section we present a space-efficient /-tolerant algorithm implementing a wait-free MWMR atomic 
register from a collection of n > 2/ servers each storing a single CAS object. Unlike previous space- 
efficient approaches our algorithm does not require support for any specialized read-modify-write function¬ 
ality besides CAS, i.e., conditional write, obviating the need for a custom server code. The algorithm’s time 
complexity is adaptive to concurrency guaranteeing that each operation op terminates in at most 0 {(?) steps 
where c = PntCont(op). 

Our algorithm, called CAS-ABD, is derived from the multi-writer ABD ifTOl emulation of an atomic 
read/write register to which we refer as MW-ABD. For completeness, the MW-ABD implementation is 
briefly reviewed in Section |4T] below (full details can be found in ifT^ i. The CAS-ABD algorithm is 
described in Section 

4.1 MW-ABD Algorithm 

The MW-ABD shared state consists of a set F of n > 2/ crash-prone objects {hi ,..., mapped to a set 
<S of n servers S = {si,..., such that 5{bi) = Sj for each 1 < i < n. Each object hi stores a pair 
{ts, val) where ts is a timestamp and val G V. We will write bi.ts and bi.val to refer to the timestamp and 
value components of bi respectively. Each timestamp ts is a pair {num, c) where num € N is a natural 
number, and c G C is a client. We will write ts.num and ts.c to refer to the ts’s first and second component 
respectively. The timestamps are ordered lexicographically so that ts < ts' if ts.num < t s' .num, or 
ts.num = t s' .num and ts.c < ts.c'. The MW-ABD types and shared states are summarized in Algorithm!!] 
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Algorithm 1 Types and States of MW-ABD and CAS-ABD 

1: TS = N X C, the set of timestapms with selectors num and c 
2: TSVal — TS x V, with selectors ts and val 

3: B = {6i,..., bn}\ the set of shared objects such that bi € TSVal for all 1 < i < n; initially bi = ((0, 0), vo) 


The sequential specification supported by each object bi € B is shown in Algorithm |2l It consists 
of two atomic operations: read and update. The read operation returns the current content of bi (i.e., 
{bi.ts, bi.val))', and the update operation is a read-modify-write (RMW) primitive comprised of atomically 
executed sequence of steps shown in lines [2]-l5] of Algorithm |2l We henceforth refer to the object type 
supporting the sequential specification in Algorithm |2] as ABD Object (ABDO). 


Algorithm 2 The ABDO sequential specification for each bi, 1 < i < n 


1: operation update(bi,t, v) 
2: if bi.ts < t 

3: bi.ts ■<— t 

4: bi.val •<— v 

5: return ack 

6: end 


7: operation read{bi) 

8: return {bi.ts, bi.val) 

9: end 


The implementation of both write and read proceeds by invoking consecutive rounds of base object 
accesses. At each round, the client triggers operations on all base objects in parallel, and awaits responses 
from at least n — f objects. The write implementation consists of two rounds. In the first round, the writer 
collects the set R of {bi.ts, bi.val) pairs from n — f objects by triggering bi.read on all objects bi G B. 
The writer then determines a new timestamp ts' to be stored alongside the value v being written so that 
t s' .num = maxjnum' : {num', *) G ii} + 1 and ts'.c is the writer’s identifier. This is followed by anofher 
round where fhe wrifer friggers bi.update{bi,ts, v) on each base objecf bi fo replace ifs currenf confenf wifh 
{ts,v). 

The firsl round of read is identical fo fhaf of wrife excepf fhaf fhe sef R is used fo identify fhe value 
v' €Y having fhe highesf fimesfamp ts' among fhe fimesfamp/value pairs in R. This is followed by anofher 
round where fhe reader invokes bi.update{bi,ts',v') on each base objecf bi fo ensure {ts',v') is available 
from all sefs of n — / base objecfs. The reader fhen refurns v'. 

4.2 CAS-ABD Algorithm 

Suppose fhaf fhe base ABD objecfs in B are subsfifufed wifh Compare-and-Swap (CAS) objecfs: i.e., fhe 
sequenfial specificafion of each bi G B consisfs of a single CAS primitive whose code is shown in lines [TSl - 
[T9]of Algorilhm[3] We obfain an implemenfafion of an /-foleranf MWMR atomic read/wrife regisfer from 
a collection of n > 2/ CAS base objecfs, fo which we refer as CAS-ABD, in a modular fashion by firsf 
constructing an ABDO from a single CAS base object bi using the emulation algorithm in Algorithm |3l and 
then, plugging the resulting construction into the MW-ABD algorithm described above. 











Algorithm 3 The ABDO emulation from a single CAS objecl bi, 1 < i < n 


Local variables: 

12 

operation read(bi) 


exp G TSVal, initially ((0,0), dq) 

13 

return CAS{bi, exp, exp) 

1 

operation update{bi, t, v) 

14 

end 

2 

done <— false 



3 

if t > exp.ts 



4 

repeat 



5 

old ■<— CAS{bi, exp, (t, v)) 



6 

if old = exp V old.ts > t 

15 

operation CAS{bi, exp, new), exp, new G TSVal 

7 

done •<— true 

16 

prev •<— bi 

8 

exp ^ old 

17 

if exp = bi 

9 

until done •<— true 

18 

bi •<— new 

10 

return ack 

19 

return prev 

11 

end 

20 

end 


In order to prove that CAS-ABD is a correct implementation of an /-tolerant wait-free MWMR read/write 
register, it suffices to show that the ABDO emulation in Algorithm [3] is a wait-free linearizable implemen¬ 
tation of the ABD object. Below we show that this is indeed the case assuming that the following property, 
to which we henceforth refer as timestamp uniqueness, is satisfied in all runs r of ABDO: for all objecfs 
bi € B, r includes af most one invocation of fhe form update{bi,ts, *). Given fhaf linearizabilify is a com- 
posable properly lEH, and MW-ABD is known lo satisfy fimeslamp-uniqueness in all runs, fhe correclness 
of CAS-ABD Ihen follows from fhe correclness of MW-ABD ifT^ . 

To show linearizabilify ||28l, we firs! idenlify for each invocalion of update and read in each possible 
run of fhe ABDO emulalion, a single step wilhin fhe operation execulion, called a linearization point (i.e., a 
single step where fhe operation lakes effecl on fhe base objecl slale), as follows: For each read invocalion, 
fhe linearization poinl is simply fhe relum slep in line[T3] The linearization poinfs for fhe update invocations 
are assigned fo eifher one of fhe following fwo sfeps: (1) if update relurns wilhouf enfering fhe loop in 
lines |4]-0 fhe condilion fesf slep in line [3] is fhe linearization poinl; and (2) if update refums due fo fhe 
condition in line [6] being Irue, Ihen fhe CAS call in line [5] is fhe linearization poinl. The linearizabilify 
Ihen follows from following lemma (proven in Section|B]of fhe Appendix), which asserfs lhal fhe sequence 
obfained by shrinking each operation fo occur alomically af ifs linearization poinl is a valid sequential run 
of ABDO. 

Lemma 4.1 Let r be a run of the ABDO emulation in Algorithm]^ and a be a sequential run obtained from 
r by shrinking each update and read operation to occur at its linearization point. Then, a is a sequential 
run of the ABD object in Algorithm^ 

Since fhe read implemenlalion is obviously wail-free, we only need fo argue wail freedom for fhe 
update operalions. To see Ihis, observe lhal t > exp.ts every lime before CAS is called in line |5] (see 
Lemmas iB.ll in Section |B] of fhe Appendix). Since bi.ts = exp.ts is a necessary condition for a successful 
CAS call, fhe value of bi can only be changed when t > bi.ts. Hence, fhe limeslamps of fhe values stored in 
each bi are non-decreasing (see Lemma lB^ in Secfion|B]of fhe Appendix). If bi.ts does no! change belween 
fhe consecufive iferalions of fhe loop in lines|4]-0 limesfamp uniqueness implies lhal fhe nexf call fo CAS will 
be successful and fhe loop terminates. Olherwise, fhe facl lhal fhe limeslamps are non-decreasing implies 
lhal bi.ts is superseded by a higher limeslamp. Since Ihere are only finitely many limeslamps lower lhan t, 
Ihe loop will terminate no later lhan Ihe value of bi.ts reaches or exceeds t. Thus, we have fhe following 
resull (see Section |B] of fhe Appendix for fhe full proof): 
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Lemma 4.2 The ABDO emulation in Algorithm\^is wait-free provided all its runs satisfy timestamp unique¬ 
ness. 

Given that timestamp uniqueness holds in all runs of MW-ABD, we receive the following: 

Theorem 4.3 The CAS-ABD algorithm is an f-tolerant implementation of a wait-free MWMR atomic reg¬ 
ister. 

Time Complexity It is easy to see that in the absence of contention, the update operation terminates 
in at most 2 rounds of the base object accesses. This can be further optimized if the clients keep a local 
copy of the most recent value read from each object fej at the read round of CAS-ABD, and then use this 
value to initialize the expected value parameter exp of CAS. Thus, in the best case scenarios when the object 
replies are received in a timely fashion, and there is no contention, update will terminate in just 1 round, 
thus achieving the 2 round complexity of MW-ABD overall. 

In the presence of contention, the number of unsuccessful CAS calls executed within the update operation 
loop in lines |4]-[9] is bounded by the number of unique timestamps returned by the CAS calls that are smaller 
than the timestamp t supplied to the update. Given the way the timestamps are chosen by the algorithm, 
the number of such timestamps per each of the c concurrently executing clients is constant. However, since 
the num component of each timestamp can be shared by concurrently executing clients, the overall time 
complexity of update can be as high as c^. In Section |B] of the Appendix, we prove that c is equal to the 
maximum number of clients that can execute concurrently with the update thus obtaining the following: 

Theorem 4.4 The CAS-ABD time complexity is adaptive to concurrency guaranteeing that each operation 
op terminates in at most 0(c^) base object accesses where c = PntCont{op). 


5 Conclusions and Future Work 

We studied the resource complexity of emulating an /-tolerant read/write MWMR register from a collection 
of atomic MWMR registers stored on crash-prone servers. We established a number of lower bounds that 
apply to any fault-tolerant emulation of a MWMR register, which satisfies weak correctness guarantees: 
single-writers safety, and solo-termination. In particular, we proved that no such emulation can use fewer 
than kf registers to support k > 0 clients or have its storage consumption adaptive to concurrency. We also 
characterized possible allocations of registers to servers by showing that if the number of registers per server 
is bounded by a known constant m, then supporting £m clients requires / -|- 1 more servers in addition to 
the requisite if servers implied by our storage bound. 

In search for a simple RMW primitive that can be leveraged for obtaining a space-efficient implemen¬ 
tation, we studied reliable storage emulations from crash-prone CAS objects. To this end, we presented 
a constant space emulation of an MWMR atomic read/write register that utilizes a single CAS object per 
server, tolerates up to a minority of server crashes, and has time complexity adaptive to point contention. 

Our work leaves some questions open for future work. First, observe that ABD can be applied in a 
straightforward fashion to implement an MWMR wait-free atomic register from fault-prone registers by 
assigning each client to a dedicated set of 2/ -|- 1 registers stored on 2/ -|- 1 different servers. An interesting 
open question is then whether our lower bound can be further tightened to match this storage cost, or there are 
emulations that can achieve a tighter storage cost (e.g., by weakening their correctness guarantees). Second, 
the worst-case time complexity of our CAS-based ABD implementation is quadratic in point contention. It 
will be interesting to explore whether it can be further improved (e.g., by modifying the ABD timestamp 
selection mechanism), or this is an inherent limitation. 
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A Space Lower Bounds 


Theorem A.l For any fc > 0, / > 0, there is no f-tolerant algorithm emulating an SW-safe solo- 
terminating MWSR register for k clients using less than kf base registers. 

Proof: Pick arbitrary k > 0, f > 0, and assume by contradiction that there exists an /-tolerant algorithm 
A that emulates an SW-safe solo terminating MWSR register for k clients with fewer than kf base registers. 
By Lemma 13.11 there exists a run r of A consisting of k high-level writes by k distinct clients such that 
by the end of r, the number of distinct base registers having a covering write is at least kf. Hence, A will 
require at least kf distinct base registers to support k clients. A contradiction. □ 


Theorem A.2 For any k > Q, and f > 0, there is no f-tolerant algorithm emulating an SW-safe solo- 
terminating MWSR register for k clients with less than 2f -\-l servers. 

Proof: Assume by contradiction that there exists an /-tolerant algorithm emulating an SW-safe solo- 
terminating MWSR register for A: > 0 clients using 2/ servers. By Corollary 13.61 there exists a run ri of 
A consisting of a single high-level write Wi by a client ci such that |5 \ 5{Cov{tQ))\ > 2/ where Iq = 0. 
Since no base registers are covered at ^o^ 1*5 \ (5(C'ou(fo))| = l^l > 2/. However, by assumption, |iS| = 2/. 
A contradiction. □ 


B Correctness of CAS-ABD 

We first argue that our emulation is a linearizable implementation of ABDO. The argument relies on the 
following auxiliary invariants. 

Lemma B.l lfline\^is reached, then t > exp.ts. 

Proof: The proof is by induction on the number of iteration of the loop in lines |4]-[9l For the base case, 
note that lineO t > exp.ts is the necessary condition for entering the loop. Hence, the lemma holds first 
time line [5] is reached. Next, assume that the result is true for all iterations k > 1, and consider iteration 
A: -|- 1. Since iteration A: -|- 1 is reached, the condition in line[6]must be false at iteration k, that is, old.ts < t. 
By linelH at the beginning of iteration A: -|- 1, exp = old, and therefore, exp.ts = old.ts < f as needed. □ 

We now show that bi.ts is non-decreasing: 

Lemma B.2 Let bi.tsi and bi.ts 2 be the values of b^.ts at times ti and t 2 respectively. lft\< t 2 , then 
bi.tsi < bi.ts2. 

Proof: Observe that bi.ts can only change as a result of a successful CAS invocation in line [5] The 
necessary condition for that to happen is exp = bi in line[5l By Lemma lBTTl t > exp.ts = bi.ts. Hence, the 
value of bi.ts is either left unchanged, or increases as needed. □ 

Next, we show linearizability: 
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Lemma B.3 Let r be a run of the ABDO emulation in Algorithmf^ and a be a sequential run obtained from 
r by shrinking each update and read operation to occur at its linearization point. Then, a is a sequential 
run of the ABD object in Algorithm |2] 

Proof: Let ■ ■ ■ such that f < tj+i, i > 1, denote the times at which the linearization points occur 
in r. The proof is by induction on ti. For the base case, consider the first linearization point ti. If ti is the 
linearization point of read, then its return value ((0, 0), uq); and if ti is the linearization point of update, 
then its return value is ack. Since both return values are identical to those produced by the read and update 
of the ABD object if invoked at the initial state, the result holds. 

Next, assume that the result is true for the first k — 1 linearization points, and consider the kth lineariza¬ 
tion point tk- If tk is the linearization point of update, then its return value is ack, which is consistent with 
the sequential specification of the ABD object. 

Suppose that t^ is the linearization point of a read operation. Suppose that the linearization point tk-i 
is associated with a read. Since for any value of exp, CAS{exp, exp) does not changes the content of bi, 
the return value of read will be the same as that of the read linearized at t^-i, which complies with the 
sequential specification of the ABD object. 

Next, suppose that the operation linearized at t^-i is an update operation u = update{bi, t, v) for some 
t gTS and u G V. Let xj denote the value of variable x at time tj. The sequential specification of the ABD 
object requires the read to return {t, v) if t > bi.tSk- 2 , and otherwise. We show that this is indeed 

the case. 

First, suppose that t > bi.tSk- 2 - Since no linearization points occur between tk -2 and and bi 
can only be changed at a linearization point, at lineO exp.tsk-i < bi.tSk -2 = h-tsk-i < t. Hence, 
linearization point t^-i must occur at line [S] This means that CAS in line [5] is successful as otherwise 
old.tsk-i > t implies that old.tsk-i = bi.tsk-i = bi.tSk -2 > t contradicting the assumption. Therefore, 
the linearization point tk-i coincides with a successful CAS in line |5] so that bi^k-i = Since no 

linearization points occur between tk-i and tk, and bi can only be changed at a linearization point, bi^k-i = 
bi,k = (L v). Hence, the read will return (f, v) as needed. 

Finally, suppose that t < bi.tsk- 2 - If f < exp.ts, then linearization point tk-i occurs at line[3l and 
therefore, u returns without changing bi. Hence, bi^k-i = Suppose t > exp.ts, and consider the 

CAS invocation occurring at the first iteration of the loop in lines |4]-0 Observe that this invocation must be 
unsuccessful as otherwise, bi.tSk -2 = exp.ts < t contradicting the assumption that t < bi.tSk- 2 - At the 
same time, old.ts = bi.tSk -2 > t. Hence, the condition in line [6] is true, which implies that u leaves the 
loop without changing the value of bi^k -2 at tk-i. We conclude that bi^k-i = Thus, the read will 

return bi^k -2 as required. □ 

We next show that the ABDO emulation is wait-free if all its runs satisfy timestamp uniqueness. 

Lemma B.4 The ABDO emulation in Algorithm\^is wait-free provided all its runs satisfy timestamp unique¬ 
ness. 

Proof: Since the read operation is obviously wait-free, we only need to show that the update operation is 
wait-free as well. 

Consider an update invocation u = update{bi,t, v). If the condition in line[3]is false, then u returns, 
and we are done. Otherwise, let tsj, j > 1, be the value of bi.ts before CAS is invoked at the jth iteration 
of the loop in lines |4]-0 
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At all iterations j > 1, if tsj > t, then the condition in line[6]is true, and the loop terminates. Otherwise, 
by Lemma IB^ and timestamp-uniqueness, tsj+i > tsj. Since there are only finitely many timestamps 
between tsi and t, there exists an iteration where the condition in line|^is satisfied, and fhe loop ferminafes. 
□ 

Given fhaf fimesfamp uniqueness holds in all runs of MW-ABD, we receive fhe following: 

Theorem B.5 The CAS-ABD algorithm is an f-tolerant implementation of a wait-free MWMR atomic reg¬ 
ister. 

Lemma B.6 Let op be an operation that invokes update at time t and let op' be another operation that 
starts at time t' > t. If k operations are invoked but do not complete before time t then ts{op').num > 
ts{op).num — k — 1 

Proof: Lef op" be fhe operafion wifh fhe highesf fimesfamp fhaf refums before fime t. By MW-ABD 
fimesfamp selection mechanism, ts{op') > ts{op"). Therefore if is sufficienf fo prove fhaf ts{op").num > 
ts{op).num — k — 1. Suppose, for fhe purpose of confradicfion, fhaf ts{op").num = ts{op).num — k — 2. 
Since every operafion incremenfs num by af mosf one and fhe fimesfamp of op is ts{op), af leasf A: -|- 1 
operafions musf be invoked before fime t wifh fimesfamps sfricfly greafer fhan ts{op).num — k — 2.Ai leasf 
one of fhese operafions refurns before fime t by fhe sfafemenf of our Lemma. This is a confradicfion since 
op" was chosen fo be fhe operafion wifh fhe highesf fimesfamp fhaf refurns before t. □ 

Lemma B.7 Let op be an operation that invokes update at time t and op' be another operation that ob¬ 
structs op on some object bi but is not one of the first two operations to obstruct op on bi. Then op' does not 
complete by time t. 

Proof: Since op is obsfrucfed af leasf fhree times, fhe following sequence of invocafions on bi musf occur 
{op.bi.CAS denofes fhe invocation of CAS on regisfer bi during high-level operafion op): 
op.bi.CAS... op.bi.C AS ... op.bi.C AS. Since all fhree invocafions of op.bi.C AS fail (fhe fhird one due 
fo op'), we know fhaf fhere are af leasf fhree invocafions of bi.CAS by ofher operafions fhaf succeed: 
op'".bi.CAS... op.bi.CAS... op".bi.CAS... op.Vi.CAS... op'.bi.CAS... op.bi.CAS. 

Since op'.bi.CAS succeeds updating bi, if learns fhe value written by op".bi.CAS, which happens after 
fhe firsf invocafion of op.bi.CAS, which in furn musf occur affer update is invoked during op, i.e., after lime 
t. Hence, op' does nof complete by fime t. □ 

Lemma B.8 Let op be an operation. For any constant n the number of operations op' that are concurrent 
with op and such that ts{op').num = n is at most PntCont[op). 

Proof: Suppose for fhe sake of confradicfion fhaf fhere exisls a conslanf n such fhaf fhere are PnfConf (op) -|- 
1 operations concurrenf wifh op wifh fhe firsf componenl of fheir fimesfamp equal fo n. Since fhere are 
PnlConl(op) -|- 1 operations and af mosf PnlConl(op) clienfs executing operafions concurrenlly wifh op af 
any single poinl in time (by definition of poinl contention), fhere is a clienl fhaf executes Iwo operafions, bolh 
of which have fhe same firsf componenl of fhe fimesfamp. However, since each clienf executes operations 
sequenlially, MW-ABD fimesfamp selection mechanism guaranfees fhaf fhe num componenl of fhe firsf 
fimesfamp will be greater fhan fhaf of fhe second one. A confradicfion. □ 
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Theorem B.9 (Time Complexity) The CAS-ABD time complexity is adaptive to concurrency guaranteeing 
that each operation op terminates in at most 0{(P‘) base object accesses where c = PntCont{op). 

Proof: Let t be the time when op invokes update. There are three types of operations that can obstruct op: 
(1) an operation that completes before time f; (2) an operation that starts but does not complete before time 
t; and (3) an operation invoked at time t or later. We next quantify the number of operations of each type 
that can obstruct op. 

By Lemma lB/Tl at most two operation completing before time t can obstruct op on a given register. Thus, 
at most two operations fall into the first category. By definition of PntCont(op), the number of operations 
of the second type is at most PntCont(op). By Lemma I r! 61 this also implies that any operation op' of the 
third type, that is, starting at time t or later, satisfies ts{op).num — ts{op').num < PntConf(op) + 1. Since 
operations with timestamps higher than ts{op) cannot obstruct op (see line[^, we only care about the case 
0 < ts{op).num — ts{op').num. There are at most PntCont(op) + 2 numbers in this range. Since all 
operations that start at time t or later and obstruct op are concurrent with op, by Lemma IB. 81 there are at 
most PntCont(op) such operations whose first timestamp component is each of the numbers in the range 
described above. Overall, there are at most (PntCont(op) + 2) * PntCont(op) operations with timestamps in 
this range, and in total there are PntCont(op)^ + 3PntCont(op) + 2 operations that may obstruct op. 

Notice that an operation op' can obstruct op on an object bi only by changing the value of bi using CAS 
on line [5] By the specification of CAS, the old value of bi was the expected value passed to CAS in this 
invocation during op'. By the conditions on lines and |9l once this CAS returns, update terminates, and 
op' returns. This means that op' can obstruct op at most once. Since each operation can obstruct op at most 
once, PntCont(op)^ + 3PntCont(op) + 2 is an upper bound on the number of times a CAS invocation during 
op can fail (for each object). □ 
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